Statistical and Machine Learning

Lab2: Regularization
Ridge, LASSO, and Elastic-net Regression

Tsai, Dai-Rong

Dataset

Prostate Cancer Data

# Set random seed
set.seed(1234)

# Packages
library(MASS) # for stepAIC
library(glmnet) # for glmnet, glmnet.cv
library(sparsegl) # for sparsegl, cv.sparsegl

# Data
prostate <- read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
  • Response
    • lpsa: log(prostate-specific antigen)
  • Predictors
    • lcavol: log(cancer volume)
    • lweight: log(prostate weight)
    • age: age
    • lbph: log(amount of benign prostatic hyperplasia)
    • svi: seminal vesicle invasion
    • lcp: log(capsular penetration)
    • gleason: Gleason scores
    • pgg45: percentage Gleason scores 4 or 5
dim(prostate)
[1] 97  9
head(prostate)
    lcavol lweight age    lbph svi     lcp gleason pgg45     lpsa
1 -0.57982  2.7695  50 -1.3863   0 -1.3863       6     0 -0.43078
2 -0.99425  3.3196  58 -1.3863   0 -1.3863       6     0 -0.16252
3 -0.51083  2.6912  74 -1.3863   0 -1.3863       7    20 -0.16252
4 -1.20397  3.2828  58 -1.3863   0 -1.3863       6     0 -0.16252
5  0.75142  3.4324  62 -1.3863   0 -1.3863       6     0  0.37156
6 -1.04982  3.2288  50 -1.3863   0 -1.3863       6     0  0.76547
str(prostate)
'data.frame':   97 obs. of  9 variables:
 $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
 $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
 $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
 $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
 $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
 $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
 $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
 $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...

Create Training/Testing Partitions

  • Split data into 80% training set and 20% test set
nr <- nrow(prostate)
train.id <- sample(nr, nr * 0.8)

training <- prostate[train.id, ]
testing <- prostate[-train.id, ]
  • Check dimension
dim(training)
[1] 77  9
dim(testing)
[1] 20  9

Model Formulae in R

Formula Model
Y ~ 1 \(Y = \beta_0 + \epsilon\)
Y ~ X1 + X2 \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon\)
Y ~ X1 + X2 - 1 \(Y = \beta_1X_1 + \beta_2X_2 + \epsilon\)
Y ~ . \(Y = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p + \epsilon\)
Y ~ . - Xp \(Y = \beta_0 + \beta_1X_1 + \ldots + \beta_{p-1}X_{p-1} + \epsilon\)
Y ~ X1:X2 \(Y = \beta_0 + \beta_{12}X_1X_2 + \epsilon\)
Y ~ X1 * X2 \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_{12}X_1X_2 + \epsilon\)
Y ~ (X1 + X2 + X3)^2 \(\begin{aligned} Y = &\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \\ &\beta_{12}X_1X_2 + \beta_{13}X_1X_3 + \beta_{23}X_2X_3 + \epsilon \\ &(\color{red}{\beta_{123}X_1X_2X_3} \text{ is NOT contained}) \end{aligned}\)
Y ~ X1 + I(X1^2)
Y ~ poly(X1, 2, raw = TRUE)
\(Y = \beta_0 + \beta_1X_1 + \beta_2X_1^2 + \epsilon\)
Table 1: Model Formulae in R

Subset Selection

stats::step()

  • This is a minimal implementation. Use stepAIC in package MASS for a wider range of object classes.
  • B. D. Ripley: step is a slightly simplified version of stepAIC in package MASS.
lm.full <- lm(lpsa ~ ., data = training)
  • Forward Selection
lm.for <- stepAIC(update(lm.full, . ~ 1),
                  scope = list(upper = lm.full),
                  direction = "forward",
                  trace = 0)
lm.for$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
lpsa ~ 1

Final Model:
lpsa ~ lcavol + lweight + svi

       Step Df Deviance Resid. Df Resid. Dev     AIC
1                              76    100.104  22.205
2  + lcavol  1  61.5899        75     38.514 -49.344
3 + lweight  1   4.5414        74     33.973 -57.005
4     + svi  1   2.7310        73     31.242 -61.458
  • Backward Selection
lm.back <- stepAIC(lm.full,
                   scope = list(lower = ~ 1),
                   direction = "backward",
                   trace = 0)
lm.back$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
    pgg45

Final Model:
lpsa ~ lcavol + lweight + svi

       Step Df Deviance Resid. Df Resid. Dev     AIC
1                              68     29.719 -55.305
2   - pgg45  1  0.02323        69     29.742 -57.245
3    - lbph  1  0.21344        70     29.956 -58.694
4     - lcp  1  0.22115        71     30.177 -60.128
5 - gleason  1  0.50342        72     30.680 -60.854
6     - age  1  0.56146        73     31.242 -61.458
  • Stepwise Selection
lm.step <- stepAIC(lm.full, scope = list(lower = ~ 1), direction = "both", trace = 0)
lm.step$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
    pgg45

Final Model:
lpsa ~ lcavol + lweight + svi

       Step Df Deviance Resid. Df Resid. Dev     AIC
1                              68     29.719 -55.305
2   - pgg45  1  0.02323        69     29.742 -57.245
3    - lbph  1  0.21344        70     29.956 -58.694
4     - lcp  1  0.22115        71     30.177 -60.128
5 - gleason  1  0.50342        72     30.680 -60.854
6     - age  1  0.56146        73     31.242 -61.458

Regularization

  • Construct Design Matrix
x <- model.matrix(lpsa ~ ., data = training)[, -1]
y <- training$lpsa
  • "lpsa ~ .": Create a design matrix for all variables except for lpsa.
  • "[, -1]": Exclude the column of intercepts. (The intercept term is fitted by default in glmnet)

Ridge / Lasso Regression

ridge <- glmnet(x = x, y = y, alpha = 0) # Ridge
lasso <- glmnet(x = x, y = y, alpha = 1) # Lasso

Arguments

  • family:
    • One of the built-in families: "gaussian"(default), "binomial", "poisson", "multinomial", "cox", "mgaussian".
    • A glm() family object. (See ?family)
codes for plot
par(mfrow = c(2, 2))
plot(ridge, xvar = "lambda", label = TRUE)
plot(ridge, xvar = "norm", label = TRUE)
plot(lasso, xvar = "lambda", label = TRUE)
plot(lasso, xvar = "norm", label = TRUE)
title("Ridge Regression", line = -2, outer = TRUE)
title("Lasso Regression", line = -22, outer = TRUE)

  • Coefficients of Ridge
coef(ridge, s = exp(seq(-2, 6, 2)))
9 x 5 sparse Matrix of class "dgCMatrix"
                  s1      s2     s3      s4      s5
(Intercept) -0.56814 -0.0446 1.3459 2.24135 2.4e+00
lcavol       0.48574  0.2744 0.0835 0.01455 2.1e-03
lweight      0.61679  0.4190 0.1317 0.02252 3.2e-03
age         -0.01271 -0.0023 0.0015 0.00041 6.2e-05
lbph         0.03367  0.0235 0.0102 0.00192 2.8e-04
svi          0.54370  0.3916 0.1570 0.02952 4.2e-03
lcp          0.00237  0.0796 0.0450 0.00894 1.3e-03
gleason      0.12456  0.0987 0.0549 0.01151 1.7e-03
pgg45        0.00063  0.0018 0.0015 0.00032 4.7e-05
  • Coefficients of Lasso
coef(lasso, s = exp(seq(-5, -1, 1)))
9 x 5 sparse Matrix of class "dgCMatrix"
                  s1     s2      s3   s4   s5
(Intercept) -0.62039 -0.495 -0.3921 0.42 1.90
lcavol       0.56882  0.553  0.5485 0.53 0.43
lweight      0.64248  0.615  0.5340 0.36 .   
age         -0.01557 -0.011 -0.0024 .    .   
lbph         0.03471  0.021  .      .    .   
svi          0.57973  0.499  0.4468 0.30 .   
lcp         -0.04098  .      .      .    .   
gleason      0.12856  0.092  0.0403 .    .   
pgg45        0.00013  .      .      .    .   

Cross-Validation

  • CV for Ridge
ridge.cv <- cv.glmnet(x = x, y = y, alpha = 0,
                      type.measure = "mse",
                      nfolds = 10)
ridge.cv

Call:  cv.glmnet(x = x, y = y, type.measure = "mse", nfolds = 10, alpha = 0) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min  0.089   100   0.512 0.0677       8
1se  0.760    77   0.573 0.0992       8
plot(ridge.cv)

coef(ridge.cv, s = "lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) -0.58941632
lcavol       0.51296312
lweight      0.63098503
age         -0.01431908
lbph         0.03608876
svi          0.56626022
lcp         -0.01742483
gleason      0.12840575
pgg45        0.00064349
  • CV of Lasso
lasso.cv <- cv.glmnet(x = x, y = y, alpha = 1,
                      type.measure = "mse",
                      nfolds = 10)
lasso.cv

Call:  cv.glmnet(x = x, y = y, type.measure = "mse", nfolds = 10, alpha = 1) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.0959    25   0.471 0.0409       3
1se 0.1839    18   0.511 0.0594       3
plot(lasso.cv)

coef(lasso.cv, s = "lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
                 s1
(Intercept) 0.12655
lcavol      0.54269
lweight     0.42974
age         .      
lbph        .      
svi         0.37431
lcp         .      
gleason     .      
pgg45       .      

Arguments

  • type.measure: loss to use for cross-validation.
    • "default" : MSE for gaussian models, deviance for logistic and poisson regressions, and partial-likelihood for the Cox model.
    • "mse"/"mae": Mean squared/absolute error for all models except the “Cox”.
    • "deviance": Deviance for logistic and poisson regressions.
    • "class": Misclassification error for binomial and multinomial logistic regressions.
    • "auc": Area under the ROC curve for two-class logistic regression.
    • "C": Harrel’s concordance measure for Cox models.
  • s: Value(s) of the penalty parameter \(\lambda\).
    • "lambda.1se" (default): Largest value of \(\lambda\) such that error is within 1 standard error of the minimum.
    • "lambda.min": Value of \(\lambda\) that gives the minimum mean cross-validated error.
    • numeric vector: Value(s) of \(\lambda\) to be used

Elastic-Net

  • CV of Elastic-Net
foldid <- sample(cut(1:nrow(training), 10, labels = FALSE))
alpha.grid <- seq(0.1, 0.9, 0.2)

elnet.cv <- do.call(rbind,
  lapply(alpha.grid, \(a) {
    cv <- cv.glmnet(x = x, y = y, foldid = foldid, alpha = a)
    data.frame(alpha = a, cv[c("lambda", "cvm")])
  })
)
head(elnet.cv)
  alpha lambda    cvm
1   0.1 8.9435 1.3568
2   0.1 8.1490 1.3411
3   0.1 7.4251 1.3203
4   0.1 6.7655 1.2984
5   0.1 6.1644 1.2750
6   0.1 5.6168 1.2458
tail(elnet.cv)
    alpha    lambda     cvm
371   0.9 0.0021409 0.47963
372   0.9 0.0019507 0.47981
373   0.9 0.0017774 0.48000
374   0.9 0.0016195 0.48016
375   0.9 0.0014757 0.48027
376   0.9 0.0013446 0.48027

Arguments

  • foldid: an optional vector of values between 1 and nfolds identifying what fold each observation is in. If supplied, nfolds can be missing.

Users can explicitly control the fold that each observation is assigned to via the foldid argument. This is useful in using cross-validation to select a value for \(\alpha\).

codes for plot
library(ggplot2)

ggplot(elnet.cv, aes(x = log(lambda), y = cvm, colour = factor(alpha))) +
  geom_line() + geom_point(size = 0.5) +
  scale_color_viridis_d(name = quote(alpha)) +
  labs(x = quote(log(lambda)), y = "Cross-Validation MSE") +
  theme_bw()

  • Optimal CV Parameters
(elnet.optim <- elnet.cv[which.min(elnet.cv$cvm), ])
    alpha   lambda     cvm
343   0.9 0.028968 0.46205
elnet <- glmnet(x = x, y = y, alpha = elnet.optim$alpha, lambda = elnet.optim$lambda)
coef(elnet)
9 x 1 sparse Matrix of class "dgCMatrix"
                    s0
(Intercept) -0.4937283
lcavol       0.5496797
lweight      0.5977336
age         -0.0089177
lbph         0.0136072
svi          0.4857698
lcp          .        
gleason      0.0803757
pgg45        .        

(Sparse-) Group Lasso

(grp <- setNames(rep(1:4, each = 2), colnames(x)))
 lcavol lweight     age    lbph     svi     lcp gleason   pgg45 
      1       1       2       2       3       3       4       4 
grplasso <- sparsegl(x = x, y = y, group = grp, asparse = 0)
grplasso.cv <- cv.sparsegl(x = x, y = y, group = grp, asparse = 0, pred.loss = "mse")
grplasso.cv

Call:  cv.sparsegl(x = x, y = y, group = grp, pred.loss = "mse", asparse = 0) 

Error measure:  Mean squared error 

             lambda index   cvm   cvsd nnzero active_grps
Max.       7.27e-02     1 1.343 0.2219      0           0
lambda.1se 1.49e-02    18 0.575 0.0950      4           2
lambda.min 1.76e-03    41 0.500 0.0803      6           3
Min.       7.27e-06   100 0.526 0.0934      8           4

Arguments

  • group: A vector of consecutive integers describing the grouping of the coefficients.
  • asparse: The relative weight to put on the \(\ell_1\)-norm in sparse group lasso.
    • asparse = 0 gives the group lasso fit; asparse = 1 gives the lasso fit.
coef(grplasso.cv, s = "lambda.1se")
9 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept) 1.757514
lcavol      0.403319
lweight     0.039859
age         .       
lbph        .       
svi         0.291706
lcp         0.079903
gleason     .       
pgg45       .       
codes for plot
library(patchwork)
p1 <- plot(grplasso, y_axis = "group")
p2 <- plot(grplasso.cv)
p1 + p2

Prediction

pred.for <- predict(lm.for, testing)
pred.back <- predict(lm.back, testing)
pred.step <- predict(lm.step, testing)

z <- model.matrix(lpsa ~ ., data = testing)[, -1]
pred.ridge <- predict(ridge.cv, newx = z, s = "lambda.min")
pred.lasso <- predict(lasso.cv, newx = z, s = "lambda.min")
pred.elnet <- predict(elnet, newx = z)

mse <- sapply(mget(ls(pattern = "^pred")), \(x) mean((x - testing$lpsa)^2))
sort(mse)
  ridge   elnet    back     for    step   lasso 
0.72073 0.76658 0.78281 0.78281 0.78281 0.82000 

References